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Abstract 

In this paper, the framework of kernel machines with two layers is in- 
troduced, generalizing classical kernel methods. The new learning method- 
ology provide a formal connection between computational architectures 
with multiple layers and the theme of kernel learning in standard regu- 
larization methods. First, a representor theorem for two-layer networks is 
presented, showing that finite linear combinations of kernels on each layer 
are optimal architectures whenever the corresponding functions solve suit- 
able variational problems in reproducing kernel Hilbert spaces (RKHS). 
The input-output map expressed by these architectures turns out to be 
equivalent to a suitable single-layer kernel machines in which the kernel 
function is also learned from the data. Recently, the so-called multiple 
kernel learning methods have attracted considerable attention in the ma- 
chine learning literature. In this paper, multiple kernel learning methods 
are shown to be specific cases of kernel machines with two layers in which 
the second layer is linear. Finally, a simple and effective rrmltiple kernel 
learning method called RLS2 (regularized least squares with two layers) is 
introduced, and his performances on several learning problems are exten- 
sively analyzed. An open source MATLAB toolbox to train and validate 
RLS2 models with a Graphic User Interface is available. 

1 Introduction 

Learning by minimizing costs in functional spaces has proven to be an im- 
portant approach to better understand many estimation problems. Indeed, the 
functional analytic point of view is the theoretical core of many successful learn- 
ing methodologies such as smoothing splines, Gaussian processes, and support 
vector machines [54, 18, 41, 52, 45, 37], collectively referred to as kernel meth- 
ods. One of the most appealing properties of kernel methods is optimality 
according to a variety of representer theorems. These results are usually pre- 
sented within the theory of RKHS [5] , and formalize the intuition that optimal 
learning machines trained with a finite number of data must be expressed by 
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a finite number of parameters, even when the hypothesis space is infinite di- 
mensional. Representer theorems have been generahzed and analyzed in many 
forms [10, 40, 49, 12, 31, 53, 14, 4], since their first appearance [26]. 

Recently, has been pointed out that standard kernel machines are somehow 
limited in their ability to approximate complex functional classes, as a conse- 
quence of being shallow architectures. In addition, existing representer theorems 
only apply to single-layer architectures, though an extension of the theory to 
include multi-layer networks would be useful to better understand the behavior 
of current multi-layer architectures and characterize new methods with more 
flexible approximation capabilities. Such extension is also suggested by com- 
plexity theory of circuits [8] as well as by biological motivated learning models, 
[43]. In the field of kernel methods, the need for complex hypothesis spaces 
reflecting "broad" prior knowledge has led to the idea of learning the kernel 
from empirical data simultaneously with the predictor [7, 27, 33, 30, 3, 55, 32]. 
Indeed, the difficulty of choosing a good hypothesis space with little available 
a-priori knowledge is significantly reduced when the kernel is also learned from 
the data. The fiexibility of algorithms implementing the framework of kernel 
learning makes also possible to address important machine learning issues such 
as feature selection, learning from heterogeneous sources of data, and multi- 
scale approximation. A major extension to the framework of classical kernel 
methods, based on the concept of hyper-kernels, has been introduced in [33], 
encompassing many convex kernel learning algorithms. In this paper, the con- 
nection between learning the kernel in standard (single layer) kernel machines 
and learning in multi-layer architectures is analyzed. We introduce the frame- 
work of kernel machines with two layers, that also encompasses classical single 
layer kernel methods as well as many kernel learning algorithms. 

Consider a generic architecture whose input-output behavior can be de- 
scribed as a function composition of two layers 

/ = /2o/i, h-.x^z, h-.z^Y, (1) 

where X is a generic set while Z and Y are two Hilbert spaces. The problem of 

learning simultaneously the two layers /i and /2 from input-outpiit data pairs 
{xi,yi) € X X Y can be formalized in a functional analytic setting. Indeed, 
in section 2 it is shown that a representer theorem holds: even if /i and /2 
arc searched into infinite dimensional function spaces, optimal learning archi- 
tectures are finite linear combination of kernel functions on each layer, where 
optimality is measured according to general regularization functionals. Remark- 
ably, such representer theorem also imply that, upon training, architecture (1) 
can be equivalently regarded as a standard kernel machine in which the kernel 
function has been learned from the data. After discussing the general result on 
the solution representation for two non-linear layers, the attention is focused 
on the case in which the second layer is linear. In section 3, we introduce a 
regularization framework that turns out to be equivalent to a general class of 
methods to perform multiple kernel learning, that is the simultaneous super- 
vised learning of a predictor and the associated kernel as a convex combination 
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of basis kernels. The general problem of learning the kernel is receiving a lot 
of attention in recent years, both from functional analytic point of view and 
from pure optimization perspectives. Since the earlier works [27, 30] , many im- 
proved optimization schemes have been proposed [2, 46, 32, 35, 6]. In section 4, 
a method called RLS2 (regularized least squares with two layers) based on reg- 
ularized multiple kernel learning with the square loss function is introduces and 
studied. Along the line of recent advances in multiple kernel learning [46, 35, 6], 
it is shown that the involved optimization can be efficiently carried out using 
a two-step procedure. For RLS2, two-step optimization turns out to be espe- 
cially simple and computationally appealing, alternating between the solution 
of a linear system and a constrained least squares problem. The application 
of RLS2 on a variety of learning problems is analyzed in section 5. State of 
the art generalization performances are achieved on several datasets, including 
multi-class classification of genomic data. An open source MATLAB toolbox to 
train and validate RLS2 models with a Graphic User Interface is available at 
http://www.mloss.org. AH the proof of Theorems and Lemmas are given in 
the Appendix. 

2 A representer theorem for architectures with 
two layers 

A learning architecture with two layers can be formalized as a map f : X ^ Y 
expressed as a function composition as in equation (1). Introduce two RKHS 

'Hi and 'H2 of vector- valued functions [31] defined over X and Z respectively, 
with (operator- valued) kernel functions and K^, and consider the following 
problem: 

Problem 1 

e 

mm L, iih o h){x,)) + RAWhWu,) + i?2(||/2|kJ. 

Here, Li -.Y ^ IR+ are loss functions measuring the approximation of train- 
ing data, while i?i,i?2 : ^+ — >■ K-t- U {-|-oo} are two extended- valued non- 
decreasing functions (not identically -|-oo) that play the role of regularization 
terms. Problem 1 is outside the scope of standard representer theorems [40] 
due to the presence of the composition (/2 0/1). Nevertheless, it still holds that 
linear combinations of a finite number of kernel functions are optimal solutions, 
as soon as there exist minimizers. 

Theorem 1 If the functional of Problem 1 admit minimizers, then there exist 
optimal solutions of Problem 1 in the form 

I t 
/i(x) =Y,K\x,x,)a,, f2{z) = Y,K^{fi{xi),z)h. 

i=l j=l 
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Therefore, there exists optimal learning architectures in the following input- 
output form: 



/(^) = (/2o/i)(a;)=^K(xi,a;)6i, K{xuX2) := K\f^{x^), f^{x2)). (2) 



Theorem 1 is a restriction theorem: the search for solutions of Problem 1 can 
be restricted to kernel machines with two layers involving a finite number of 
kernel functions, even when "Hi and 'H2 are infinite dimensional spaces. Notice 
that Theorem 1 is not an existence theorem, since existence of minimizers is 
one of the hypotheses. As shown in the next sections, existence can be ensured 
under mild additional conditions on Li,Ri,R2. Under the general hypothesis 
of Theorem 1, uniqueness of minimizers in Problem 1 is also not guaranteed, 
even when loss functions Li are strictly convex. Notice also that Theorem 
1 do admit the presence of optimal solutions not in the form of finite kernel 
expansions. However, if such solutions exist, then their projections over the 
finite dimensional span of kernel sections are optimal as well, so that one can 
restrict the attention to kernel machines with two layers also in this case. Finally, 
when Ri and R2 arc strictly increasing, it holds that every optimal solution of 
Problem 1 can be expressed as a kernel machine with two layers. 

3 Multiple kernel learning as a kernel machine 
with two layers 

Theorem 1 shows that training an architecture with two layers is equivalent to 
train simultaneously a single-layer kernel network and the kernel function, see 
equation (2). In this section, it is shown that multiple kernel learning, consisting 
in simultaneous learning of a finite linear combination of kernels and the asso- 
ciated predictor, can be interpreted as a specific instancejof kernel architecture 
with two layers. Introduce a set of m positive kernels Ki defined on X x X, 
called basis kernels and consider the following choice for Hi and ^2- 

• Til is an RKHS of vector valued functions f : X ^ associated with 
the matrix-valued kernel function such that 



• 'H2 is the RKHS of real valued functions / : — )• R associated with the 
linear kernel 




K2 {Z1,Z2) 

where 5 is a diagonal scaling matrix: 



z'[Sz2, 



S = diag {si 



Sm} > 0. 
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For any / G Hi, let p, (i = 1, . . . , m) denote its components. Introduce the 
indicator function / of the interval [0, 1] defined as 



I{x) 



0, < X < 1 

+0O, X > 1 



let Li -.Y ^ 1R+ denote lower semi-continuous convex loss functions and A > 0. 

In the following, wc; analyze the particular case of Problem 1 in which Ri is a 
square regularization and R2 is the indicator function regularization: 

A o 



Ri{x) 



R2{x) = I{x). 



Problem 2 



mm 

he-Hi, 

/2 6«2 



J2Lii{f2oh){xi)) + -\\hfn^+I{\\M\n.) 



Let's briefly discuss the choice of regularizers. First of all, notice that both Ri 
and i?2 are convex functions. Since Li are convex loss functions and /2 is linear, 
the problem is separately convex in both /i and f2- Apparently, regularizing 
with the indicator function /(II/2IIW2) i'' equivalent to impose the constraint 
11/2 11^2 — 1- Lemma 1 below shows that minimization into the unitary ball can 
be carried out without any loss of generality. 

Lemma 1 Let (/i,/!) denote an optimal solution of the following problem: 



mm 
/2e-H2 l-»=i 



\.f2\\n2 



Then, (/i,/2) ~ (/^/i,/!//?) (m o,n optimal solution of Problem 2 with A = 
a//9^ and satisfy 

./ = ./2 0/i -/2*o/*. (3) 

Thanks to the scaling properties coming from the linearity of the second 
layer and the use of the indicator function, the introduction of an additional 

regularization parameter can be avoided, thus significantly reducing the com- 
plexity of model selection. The next Theorem characterizes optimal solutions 
of Problem 2. 

Theorem 2 There exist optimal solutions f\ and f2 of Problem 2 in the form 



fl{x) = SiWi^CjKi{Xj,x), f2{z) = z^Sw. 
i=i 

Letting di := Siwf, optimal coefficients (c, d) solves the multiple kernel learning 
Problem 3 below, where 



Qiz) :=^L,(z,). 



(4) 
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Finally, the solution of Problem 2 can be written as in equation (2), where the 
kernel K satisfies 

m it 

K{x,y) = ^d^K^{x,y), Ki{x,y) = X! ^Jl^32^'^^i^n'^)Ki{^j2,y)■ 

i=l ji=lj2=l 

(5) 



Problem 3 



subject to 

rri m 

R%=SkKk{x,,Xj), i^(d) = ^cifei?^ 4>0, ^4<1- (6) 

fe=i fe=i 

Theorem 2 shows that the variational Problem 2 for a two-layer kernel ma- 
chine is equivalent to the multiple kernel learning Problem 3. The non-negativity 
constraints > leads to a sparse selection of a subset of basis kernels. In 
standard formulations, multiple kernel learning problems feature the equality 
constraint X]jt=i '^■k = 1, instead of the inequality in (6). Nevertheless, Lemma 
2 below shows that there always exist optimal solutions of Problem 4 satisfying 
the equality, so that the two optimization problems are equivalent. 

A few comments on certain degeneracies in Problem 2 are in order. First 
of all, observe that the absolute value of optimal coefficients Wi characterizing 
the two layer kernel machine is given by \wi\ = \fdifsi, but sign(wj) is undeter- 
mined. Then, without loss of generality, it is possible to choose wi = \/ dij Si. 
Second, observe that the objective functional of Problem 3 depends on c through 
the product Rc. When R is singular, the optimal vector c is not unique (inde- 
pendently of Q). In particular, if v belongs to the null space of i?, then c + 7W 
achieves the same objective value of c, for any 7 G M. This case also occurs in 
standard (single-layer) kernel methods. One possible way to break the indeter- 
mination, again without any loss of generality, is to constrain c to belong to the 
range of R. With such additional constraint, there exists z such that 

c = R^z, (7) 

where f denote the Moore-Penrose pseudo-inverse (notice that, in general, z 
might be different from Rc). Remarkably, the introduction of such change of 
variable makes also possible to derive an addition formulation of Problem 3, 
which can be shown to be a convex optimization problem. Indeed, by rewriting 
Problem 3 as a function of (^;, d), the following problem is obtained: 



Problem 4 



mm 



Q{z) + ^z^I^(d)z^, subject to (6). 
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Lemma 2 Problem 4 is a convex optimization problem and there exists an op- 
timal vector d satisfying the equality constraint 

J2dk = l- (8) 
fe=l 

Lemma 2 completes the equivalence between the specific kernel machines 
with two layers obtained by solving Problem 2 and multiple kernel learning 
algorithms. The Lemma also gives another important insight into the structure 
of Problem 3: local minimizers arc also global minimizers, a property that 
directly transfer from Problem 4 through the change of variable (7). 



3.1 Linear machines 

In applications of standard kernel methods involving high-dimensional input 
data, the linear kernel on 

K{xi,X2) = x1x2 (9) 

plays an important role. Optimization algorithms for linear machines are being 

the subject of a renewed attention in the literature, due to some important ex- 
perimental findings. First, it turns out that linear models are already enough 
flexible to achieve state of the art classification performances in application 
domains such as text document classification, word-sense disambiguation, and 
drug design, see e.g. [24]. Second, linear machines can be trained using ex- 
tremely efiicient and scalable algorithms [23, 44, 16]. Finally, linear methods 
can be also used to solve certain non-linear problems (by using non-linear fea- 
ture maps), thus ensuring a good trade-off between flexibility and computational 
convenience. 

Linear kernels arc also meaningful in the context of multiple kernel learning 
methods. Indeed, when^the input set X is a subset of M^, a possible choice for 
the set of basis kernels Kk is given by linear kernels on each component: 

Kk{x^,X2)=x\xl (10) 

Such a choice makes the input output map (2) a linear function: 

m / I \ 

where 

^2 Cjxi- (12) 

i=l 

Here, an important benefit is sparsity in the vector of weights a, that follows im- 
mediately from sparsity of vector d. In this way, linear multiple kernel learning 
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algorithms can simultaneously perform regularization and linear feature selec- 
tion. Such property is apparently linked to the introduction of the additional 
layer in the architecture, since standard kernel machines with one layer are not 
able to perform any kind of automatic feature selection. From the user's point 
of view, linear kernel machines with two layers behave similarly to sparse ii 
regularization methods such as the Lasso [50], performing feature selection by 
varying with continuity a shrinking parameter. However, it seems that £i reg- 
ularization methods cannot be interpreted as kernel machines (not even with 
two layers) and these two classes of algorithms are thus distinct. An instance of 
linear regularization methods with two layers is proposed in subsection 4.2 and 
analyzed in the experimental section 5. 

4 Regularized least squares with two layers 



Algorithm 1 Alternate optimization for RLS2 

i <(- a.Tgmaxk=i,...,my'^ R'^y 

d ^ Ci 

while (stopping criterion is not met) do 

for j G B do 

R^ R + djR^ 
end for 

c ^ Solution of the linear system (R + XI) c = y 

for i = 1, . . . , m do 

Vi -e- R^c 
end for 

d <— Solution of Problem (7). 
B^{j: d, + 0} 
end while 



In the previous section, a general class of convex optimization problems to 
learn finite linear combinations of kernels is shown to be equivalent to a two-layer 
kernel machine. As for standard kernel machines, different choices of loss func- 
tions Li lead to a variety of learning algorithms. For instance, from the results 
of the previous section it follows that the two-layer version of standard Support 
Vector Machines with "hinge" loss functions Li{z) = (1 — i/iz)^ is equivalent to 
the SILP (Semi-Infinite Linear Programming) multiple kernel learning problem 
studied in [46] , whose solution can be computed, for instance, by using gradient 
descent or SimplcMKL [35]. 

In this section, attention is focussed on square loss functions Li{^z) = [yi — 
2)^/2 and the associated kernel machine with two layers. As we shall show, 
coefficients Cj and dj defining the architecture as well as the "equivalent input- 
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output kernel" K can be conipiitcd by solving a very simple optimization prob- 
lem. Such problem features the minimization of a quartic functional in (c, d), 
that is separately quadratic in both c and d. It is worth noticing that the square 
loss function can be; used to solve regression problems as well as classification 
ones. Indeed, generalization performances of regularized least squares classifiers 
have been shown to be comparable to that of Support Vector Machines on many 
dataset, see [38, 17] and references therein. 

Problem 5 (Regularized least squares with two layers (RLS2)) 

min ( - \\y — R(d)c\\^ + —c^R(d)c] , subject to (6). 
Let Am denote the standard (m — l)-simplex in R"*: 

A„:=|dGM™: d>0, ^<ii = l|- 

For any fixed d, Problem 5 is an unconstrained quadratic optimization problem 
with respect to c. It is then possible to solve for the optimal c* in closed form 
as a function of d: 

(m \ -1 

J2diR' + XIj y. (13) 

As shown in Lemma 3 below, Problem 5 can be reduced to the following Problem 

in d only. 

Problem 6 

min —v^c*(d), 

Lemma 3 The pair (c*, d*) is an optimal solution of Problem 5 if and only if 
equation (13) holds and d* is an optimal solution of Problem 6. 

Along the lines of recent developments in multiple kernel learning optimiza- 
tion [46, 35], we propose a two-step minimization procedure that alternates 
between kernel and predictor optimization. The specific structure of our prob- 
lem allows for exact optimization in each of the two phases of the optimization 
process. Let 

V:={v, ■■■ Vm) = { R^c ■■■ R^c), u:={y-^ 

For any fixed c, minimization with respect to d boils down to the following 
simplex-constrained least squares problem, as ensured by the subsequent Lemma 
4. 



Problem 7 



min II Vd — m|P. 
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Lemma 4 For any fixed c, the optimal coefficient vector d of Problem 5 can be 

obtained as the solution of Problem 7. 

Optimal coefficients can be computed using an iterative two-step procedure 
such as the Algorithm 1, that alternates between minimization with respect to 
c obtained through the solution of the hncar system (13), and the solution of 
the simplex-constrained least squares Problem 7 in d. The non-negativity con- 
straint induce sparsity in the vector d, and thus also in the input-output kernel 
expansion. To undc;rstand the; initialization of coefficients c and d in Algorithm 
1, consider the limiting solution of the optimization problem when the regular- 
ization parameter tends to infinity. Such solution is the most natural starting 
point for a regularization path because optimal coefficients can be computed in 
closed form. 

Lemma 5 The limiting solution of Problem 5 when A +oo is given by 
{coo,doo) = (0,6,) , i e arg max y^R^y. 

k=l,...,m 

As shown in subsection 4.3, the result of Lemma 5 can be also used to give 
important insights into the choice of the scaling S in the second layer. 

4.1 A Bayesian mciximum a posteriori interpretation of 
RLS2 

The equivalence between regularization Problem 2 and the multiple kernel learn- 
ing optimization Problem 8 can be readily exploited to give a Bayesian MAP 
(maximum a posteriori) interpretation of RLS2. To specify the probabilistic 
model, we need to put a prior distribution over the set of functions from X into 
M and define the data generation model (likelihood). In the following, A^(/i, cr^) 
denote a real Gaussian distribution with mean fi and variance cr^, GM{f,K) 
a Gaussian measure on the set of functions from X into with mean / and 
covariance function and U {ft) the uniform distribution in R™ over a set O 
of positive finite measure. Let / : X -> M be such that 

f = w^Sfi, 

where f\ : X ^ K™ is distributed according to a Gaussian measure over %i 
with zero mean and (matrix- valued) covariance function , and w is a random 
vector independent of /i, distributed according to an uniform distribution over 
the ellipsoid Eg := {u; e : vF Sw < 1}: 

/i~GM(0,JCi), w^U{Es). 

Regarding the likelihood, assume that the data set T) := {(a;,, yi)}f^j, is gener- 
ated by drawing pairs (xi,yi) independently and identically distributed accord- 
ing to the usual additive Gaussian noise model: 

yi\{xi,f)r~^N{f{xi),a'). 
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When Hi is a finite dimensional space, the MAP estimate of / is: 

r = w*^sf^, 

where maximize the posterior density: 

p{f\V)^p{V\f)pih)p{w). 

Specifically, we have 

2a2 



mm TT / {yi-w'^Sfi{xi)y 



i=l 

oc exp 



1, xGEs 
0, else 



It follows that w* e Es and, by taking the negative logarithm of p{f\T>), that 
the MAP estimate coincides with the solution of the regularization Problem 2 
with square loss functions and A = cr^. When Hi is an infinite-dimensional 
function space, the Gaussian measure prior for fi do not admit a probability 
density. Nevertheless, the regularization Problem 2 can be still recovered by 
understanding MAP estimate as a maximal point of the posterior probability 
measure, as described in [21]. 



4.2 Linear regularized leeist squeires with two layers 

As described in subsection 3.1, when the input set X is a subset of M™ the 
linear choice of basis kernels (10) produces linear machines with feature selection 
capabilities. First of all, recall that standard regularized least squares with the 
linear kernel (9) boils down to finite-dimensional Tikhonov regularization [51] 
also known as ridge regression [22]: 

^rnm{\\y-Haf + X\\af), 

where H £ M^^™ denote the matrix of inputs such that Hij = xl- Lemma 6 
below states that the linear version of RLS2 is equivalent to a "scaled" ridge 
regression problem, in which the optimal scaling is also estimated from the data. 
Let 

r(rf) := diag{sidi,...,s„rfTO}. (14) 

For any fixed d, let n{d) be the number of non-zero coefficients rfj, V{d) G 
■^n{d)xn(d) (jg^Q^Q |;J^q diagonal sub-matrix of F containing all the strictly positive 
coefficients. Moreover, let H denote the scaled sub-matrix of selected features 
H := HT. 



11 



Problem 8 



min Wy-Hzf + \\\T'^/'^{d)zf, subject to (14). 



Lemma 6 When basis kernels are as in (10), the optimal solution of Problem 
5 can be written as in (11)-(12), where {z,d) solves Problem 8. 

From Problem 8, one can easily see that when d is fixed to its optimal value, 
the optimal z in Problem 8 is given by the familiar expression: 



The result of Lemma 6 can be also used to give an interesting interpretation 
of linear RLS2. In fact, the coefficient Sidi can be interpreted as a quantity 
proportional to the inverse variance of the i-th. coefficient Zj. Hence, Problem 8 
can be seen as a Bayesian MAP estimation with Gaussian residuals, Gaussian 
prior on the coefficients and uniform prior over a suitable simplex on the vector 
of inverse coefficients's variances. 

It is useful to introduce a notion of "degrees of freedom", see e.g. [15, 20]. 
Degrees of freedom is an index more interpretable than the regularization pa- 
rameter, and can be also used to choose the regularization parameter according 
to tuning criteria such as Cp [28], AIC [1], BIG [42], GGV [11]. A general expres- 
sion for the effective degrees of freedom of non-linear kernel regression methods 
with one layer, based on the SURE (Stein's Unbiased Risk Estimator) approx- 
imation [48] has been recently derived in [13]. For linear RLS2, the following 
quantity seems an appropriate approximation of the degrees of freedom: 



Expression (15) corresponds to the equivalent degrees of freedom of a linear 
regularized estimator with regressors fixed to H and diagonal regularization 
AF. Notice that (15) neglects the non-linear dependence of matrix H on the 
output data and does not coincide with the SURE estimator of the degrees of 
freedom. Nevertheless, the property < d,f{\) < m holds, so that df can be 
conveniently used to interpret the complexity of the linear RLS2 model (see 
subsection 5.1 for an example). 

4.3 Choice of the scaling and feature/kernel selection 

The ability of RLS2 to select features or basis kernels is highly influenced by the 
scaling S in the second layer. In this subsection, we analyze certain scaling rules 
that are connected with popular statistical indices, often used as "filters" for 
feature selection [19]. Since the issue of scaling still needs further investigation, it 
is not excluded that new rules diflFerent from those mentioned in this subsection 
may work better on specific problems. 





(15) 
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A key observation is the following: according to Lemma 5, RLS2 with heavy 
regularization favors basis kernels that maximizes the quantity = y^R'^y, 
that represents a kind of alignment between the kernel R'^ and the outputs. 
This means that RLS2 tends to select kernels that are highly aligned with the 
outputs. Since each alignment is proportional to the scaling factor Sk, an 
effective way to choose the scaling is one that makes the alignment a meaningful 
quantity to maximize. First of all, we discuss the choice of scaling for the linear 
RLS2 algorithm introduces in subsection 4.2. The generalization to the case of 
non-linear basis kernels easily follows by analyzing the associated feature maps. 

In the linear case, we have R'' ^'^ , where x'' is the fc-th feature 

vector, so that 

Ak = Skiy'^x'^f. 

By choosing 

sk = {\\y\\\\x''\\)-\ 

the alignment becomes the squared cosine of the angle between the fc-th feature 
vector and the output vector: 

(y^ x^ \ ^ 

In particular, when the outputs y and the features x^ are centered to have zero 
mean, A^ coincides with the squared Pearson correlation coefficient between the 
outputs and the /c-th feature, also known as coefficient of determination. This 
means that RLS2 with heavy regularization selects the features that mostly 
correlates to the outputs. Since the term WyW^ is common to all the factors, one 
can also use 

Sk = wx^r^ (16) 

without changing the profile of solutions along a regularization path (though, 
the scale of regularization parameters is shifted). Observe that rule (16) may 
also make sense when data are not centered or centered around values other 
than the mean. In fact, for some datasets, performances are better without 
any centering (this is the case, for instance, of Experiment 1 in subsection 5.1). 
Notice also that (16) only uses training inputs whereas, in a possible variation, 
one can replace x^ with the vector containing values of the k-i\i feature for 
both training and test data (when available). The latter procedure sometimes 
work better than scaling using training inputs only, and will be referred to as 
transductive scaling in the following. For binary classification with labels ±1, 
the choice (16) with or without centering still make sense, but other rules are 
also possible. Let ^+ and i- denote the number of samples in the positive 
and negative class, respectively, and and m'^ denote the within-class mean 
values of the fc-th feature: 

< = ^ E ™- = ^ E ^^ 

i:Vi = l i:yi=-l 
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By choosing 

^ K)2 + (C7^)2' ^^^^ 

where cr^ and cr^ denote the within class standard deviations of the fc-th feature, 
one obtain 



When the two classes are balanced {£+ = £- = £/2), A}, boils down to a quantity 
proportional to the classical Fisher criterion (or signal-to-interference ratio): 

e (m^-m^)2 



4 ((T^)2 + (f7^)2- 



Rules (16) and (17) can bo generalized to the case of non-linear basis kernels, 
by observing that non-linear kernels can be always seen as linear upon mapping 
the data in a suitable feature space. A rule that generalizes (16) is the following 
[?, see e.g.]]Rakotomamonjy08: 



Sk=\^Kk{x,,x,)j , (18) 

that amounts to scale each basis kernel by the trace of the kernel matrix, and 
reduces exactly to (16) in the linear case. Also (18) can be applied with or 
without centering. A typical centering is normalization in feature space, that 
amounts to subtract l/^X^jj Kk{xi, xj) to the basis kernel Kk, before comput- 
ing (18). A transductive scaling rule can be obtained by extending the sum to 
both training and test inputs, iianicly computing the inverse trace of the overall 
kernel matrix, as in [27]. Finally, a non-linear generalization of (17) is given by 
the following: 



Sk 
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-i:yi=l 
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Kk{xi, Xj) 

j:yj = l ^ 
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5 Experiments 

In this section, the behavior of linear and non-linear RLS2 on several learning 
problems is analyzed. In subsection 5.1, an illustrative analysis of linear RLS2 is 
proposed, whose goal is to study the feature selection capabilities and the depen- 
dence on the regularization parameter of the algorithm in simple experimental 
settings. RLS2 with non-linear kernels is analyzed in subsection 5.2, where an 
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extensive benchmark on several regression and classification problems from UCI 
repository is carried out. Finally, multi-class classification of microarray data is 
considered in subsection 5.3. 

Computations are carried out in a Matlab environment and the sub-problem 
(Problem 7) is solved using an SMO-like (Sequential Minimal Optimization) al- 
gorithm [34] . Current implementation features conjugate gradient to solve linear 
systems and a sophisticated variable shrinking technique to reduce gradient com- 
putations. The stopping criterion for Algorithm 1 used in all the experiments 
is the following test on the normalized residual of linear system (13): 

\\iR + XI)c-y\\ <S\\y\\. 

The choice 6 = 10~^ turns out to be sufficient to make all the coefficients 
stabilize to a good approximation of their final values. A full discussion of 

optimization details is outside the scope of the paper. All the experiments have 
been run on a Core 2 Duo T7700 2.4 GHz, 800 MHz FSB, 4 MB L2 cache, 2 
GB RAM. 

5.1 Linear RLS2: illustrative experiments 

In this subsection, we perform two experiments to analyze the behavior of lin- 
ear RLS2. In the first experiment, a synthetic dataset is used to investigate the 
ability of linear RLS2 to perform feature selection. The dependence of general- 
ization performances of RLS2 and other learning algorithms on the training set 
size is analyzed by means of learning curves. The goal of the second experiment 
is to illustrate the qualitative dependence of coefficients on the regularization 
parameter and give an idea of the predictive potentiality of the algorithm. 

Experiment 1 (Binary strings data) In the first experiment, a synthetic 

Binary strings dataset has been generated: 250 random binary strings Xi G 
{0, 1}^"° are obtained by independently sampling each bit from a Bernoulli 
distribution with p = 0.5. Then, the outputs have been generated as 

yi = x\ + -H x\ + Ci, 

where ej ~ N{0,a'^) are small independent Gaussian noises with zero mean 
and standard deviation a = 0.01. In this way, the outputs only depend on the 
ffist three bits of the input binary string. The dataset has been divided into a 
training set of 150 input output pairs and a test set containing the remaining 
100 data pairs. We compare the RMSE (root mean squared error) learning 
curves obtained by varying the training set size using five different methods: 

1. RLS (regularized least squares) with "ideal" kernel: 

K{xi,X2) = xlxl + xlxl + xlxl. (19) 

2. RLS with linear kernel (9) (ridge regression). 
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Figure 1: Binary strings data: lower bounds of RMSE learning curves. The 
top plot shows test RMSE for training set sizes between 1 and 150 with five 
different methods: RLS with ideal kernel (see details in the text), RLS with 
linear kernel (ridge regression), RLS with Gaussian RBF kernel, linear RLS2 
and Lasso. The bottom plot is the "zoomed" version of the top plot for training 
set sizes between 1 and 30, for RLS with ideal kernel, linear RLS2 and Lasso. 
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3. RLS with Gaussian RBF kernel 

K{xuX2) = exp (^-0.01 ^i^^) . 

4. RLS2 with hnear basis kernels (10) and scaling (16). 

5. Lasso regression. 

The goal here is to assess the overall quality of regularization paths associated 
with different regularization algorithms, independently of model selection pro- 
cedures. To this end, we compute the RMSE on the test data as a function 
of the training set size and evaluate the lower bounds of the learning curves 
with respect to variation of the regiilarization parameter. Results are shown in 
Figure 3, whose top plot reports the lower bounds of learning curves for all the 
five algorithms with training set sizes between 1 and 150. Notice that all the 
five methods are able to learn, asymptotically, the underlying "concept" , up to 
the precision limit imposed by the noise, but methods that exploits coefficients 
sparsity are faster to reach the asymptotic error rate. Not surprisingly, the best 
method is RLS with the "ideal kernel" (19), which incorporates a strong prior 
knowledge: the dependence of the outputs on the first three bits only. Though 
knowing in advance the optimal features is not realistic, this method can be 
used as a reference. The slowest learning curve is that associated to RLS with 
Gaussian RBF kernel, which only incorporates a notion of smoothness. A good 
compromise is RLS with linear kernel, which uses the knowledge of linearity of 
the underlying function, and reaches a good approximation of the asymptotic 
error rate after seeing about 100 strings. The remaining two methods (Lasso and 
linear RLS2) incorporate the knowledge of both linearity and sparsity. They are 
able to learn the underlying concept after seeing only 12 examples, despite the 
presence of the noise. Since after the 12-th example Lasso and linear RLS2 are 
basically equivalent, is it interesting to see what happen for very small sample 
sizes. The bottom plot of Figure 3 is a zoomed version of the top plot with 
training set sizes between 1 and 30, showing only the learning curves for the 
three methods that impose sparsity. Until the 8-th example, the Lasso learning 
curve stays lower than the RLS2 learning curve. After the 8-th example, the 
RLS2 learning curve stays uniformly lower than the Lasso, indicating an high 
efficiency in learning noisy sparse linear combinations. Since the multiple kernel 
learning interpretation of RLS2 suggests that the algorithm is being learning the 
"ideal" kernel (19) simultaneously with the predictor, it might be interesting to 
analyze the asymptotic values of kernel coefficients rf;. Indeed, after the first 
12 training examples, RLS2 sets to zero all the coefficients di except the first 
three, which are approximately equal to 1/3. 

Experiment 2 (Prostate Cancer data) Linear RLS2 is applied to the 
Prostate Cancer dataset, a regression problem whose goal is to predict the 
level of prostate-specific antigen on the basis of a number of clinical measures 
in men who were about to receive a radical prostatectomy [47]. These data 
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Figure 2: Prostate Cancer data: 10-fold cross-validation prediction error 
curves and their standard errors bands estimated for linear RLS2. Model com- 
plexity increases from the right to the left. The vertical line corresponds to the 
least complex model within one standard error of the best. 



Table 1: Prostate Cancer data: comparison of RLS2 with other subset 
selection and shrinkage methods. Estimated coefhcients, test error and their 
standard error are reported. Results for methods other than RLS2 are taken 
from [20]. Blank entries corresponds to variables not selected. 



Term 


RLS2 


Best subset 


LS 


Ridge 


Lasso 


PGR 


PLS 


Intercept 


2.452 


2.477 


2.465 


2.452 


2.468 


2.497 


2.452 


LCAVOL 


0.544 


0.740 


0.680 


0.420 


0.533 


0.543 


0.419 


LWEIGHT 


0.207 


0.316 


0.263 


0.238 


0.169 


0.289 


0.344 


AGE 






-0.141 


-0.046 




-0.152 


-0.026 


LBPH 


0.104 




0.210 


0.162 


0.002 


0.214 


0.220 


SVI 


0.170 




0.305 


0.227 


0.094 


0.315 


0.243 


LCP 






-0.288 


0.000 




-0.051 


0.079 


GLEASON 






-0.021 


0.040 




0.232 


0.011 


PGG45 


0.064 




0.267 


0.133 




-0.056 


0.084 


Test error 
Std error 


0.454 
0.152 


0.492 
0.143 


0.521 
0.179 


0.492 
0.165 


0.479 
0.164 


0.449 
0.105 


0.528 
0.152 
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Figure 3: Prostate Cancer data: profiles of RLS2 coefficients witli respect to 
a continuous variation of tlie regularization parameter. Coefficients are plotted 
versus df{X), the approximate degrees of freedom. The vertical line corresponds 
to the value of A chosen in the validation phase. 
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arc used in the textbook [20] to compare different feature selection and shrink- 
age methods, and have been obtained from the web site http://www-stat. 
stanford.edu/ElemStatLearn/. Data have been preprocessed by normahzing 
all the inputs to zero mean and unit standard deviation. The datasct is divided 
into a training set of 67 examples and a test set of 30 examples. To choose 
the regularization parameter, the 10-fold cross-validation score has been com- 
puted for different values of A in the interval [lO""*, 10^] on a logarithmic scale. 
The scaling coefficients Si are chosen as in (16), thus normalizing each training 
feature to have unit norm. An intercept term equal to the average of training 
outputs has been subtracted to the outputs before estimating the other coeffi- 
cients. For each of the dataset splits, the MSE (mean squared error) has been 
computed on the validation data. Figure 2 reports average and standard error 
bands for validation MSE along a regularization path. Following [20], we pick 
the value of A corresponding to the least complex model within one standard 
error of the best validation score. 

In a second phase, the whole training set (67 examples) is used to compute 
the RLS2 solution with different values of A. Figure 3 reports the profile of 
RLS2 coefficients aj, see equation (12), along the whole regularization path as 
a function of the degrees of freedom defined as in (15). RLS2 does a continuous 
feature selection that may resemble that of the Lasso. However, the dependence 
of coefficients on the regularization parameter is rather complex and the profile 
in Figure 2 is not piecewise linear. In correspondence with the value of A chosen 
in the validation phase, RLS2 selects 5 input variables out of 8. Table 1 reports 
the value of coefficients estimated by RLS2 together with the test error and 
his standard error. For comparison. Table 1 also reports models and results 
taken from [20] associated with LS (Least Squares), Best subset regression. 
Ridge Regression, Lasso regression, PGR (Principal Component Regression), 
PLS (Partial Least Squares). The best model on these data is PGR, but RLS2 
achieves the second lowest test error by using only 5 variables. 

5.2 Non-lineeir RLS2: regression and classification bench- 
mcirk 

In this subsection, benchmark experiments on four regression and six classifi- 
cation problems from UCI repository are illustrated (Table 2). RLS2 has been 
run on 100 random dataset splits with two different training/test ratios: 60/40 
and 70/30. For each dataset split, an approximate regularization path with 30 
values of A on a logarithmic scale in the interval [lO~^, 10®] has been computed. 
To speed-up the regularization path computation, a warm-start technique is em- 
ployed: the value of A is iteratively decreased, while kernel-expansion coefficients 
di are initialized to their optimal values obtained with the previous value of A. 
Performances are measured by accuracy for classification and RMSE (root mean 
squared error) for regression. For each dataset split and value of the regulariza- 
tion parameter, the following quantities are computed: prediction performance 
on the test set, number of selected kernels (number of non-zero di), training 
time in seconds and number of iterations to compute the whole regularization 
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Tabic 2: Data sets used in the experiments. The first four are regression prob- 



lems while the last six are classification problems. 



Dataset 


Feature standardization 


Examples 


Features 


Kernels 


AUTO-MPG 


Yes 


392 


7 


104 


CPU 


Yes 


209 


8 


494 


Servo 


No 


167 


4 


156 


Housing 


Yes 


506 


13 


182 


Heart 


Yes 


270 


13 


182 


Liver 


No 


345 


6 


91 


Pima 


Yes 


768 


8 


117 


Ionosphere 


Yes 


351 


33 


442 


Wpbc 


Yes 


194 


34 


455 


Sonar 


No 


208 


60 


793 



Table 3: RLS2 regression and classification: average and standard deviation 
of test performance (RMSE for regression, accuracy for classification) over 100 
dataset splits. Results with two difFcrcnt training/tost ratio are reported: 60/40 
(first two columns), and 70/30 (last two columns). 



Dataset 


60/40 70/30 


AuTO-MPG 

CPU 

Servo 

Housing 


2.79(0.209) 2.72(0.224) 
21.8(11.3) 21.2(11.9) 
0.755(0.116) 0.696(0.152) 
3.61(0.465) 3.49(0.558) 


Heart 
Liver 

Pima 

Ionosphere 

Wpbc 

Sonar 


83.8(2.98) 84 (3.28) 
69(3.57) 69.8(3.79) 
76.7(1.92) 77.1(1.96) 
93.3(1.83) 93.5(1.93) 
76.7(3.71) 76.4(4.63) 
83.6(3.69) 86.1(4.52) 
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Validation RiUISE (100 random splits 70/30) 



Validation RMSE {100 random splits 70/30) 




Figure 4: RLS2 on the Auto-mpg (left) and the Cpu (right) dataset: RMSE on 
the test data (top), number of selected kernels (center), and number of iterations 
(bottom) along a regularization path. 
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Validation RiUISE (100 random splits 70/30) Validation RMSE (100 random splits 70/30) 




10"^ 10° 10^ 10"^ lo" 10^ 

A A 



Figure 5: RLS2 on the Servo (left) and the Housing (right) dataset: RMSE on 
the test data (top), number of selected kernels (center), and number of iterations 
(bottom) along a regularization path. 
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Figure 6: RLS2 on the Heart (left) and Liver (right) dataset: classification 
accuracy on the test data (top), number of selected kernels (center), and number 
of iterations (bottom) along a regularization path. 



24 




Figure 7: RLS2 on the Pima (left) and Ionosphere (right) dataset: classifica- 
tion accuracy on the test data (top), number of selected kernels (center), and 
number of iterations (bottom) along a regularization path. 
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Validation Accuracy (100 random splits 70/30) 



Validation Accuracy (100 random splits 70/30) 
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Selected kernels (100 random splits 70/30) 



Selected kernels (100 random splits 70/30) 




Number of iterations 



00 random splits 70/30) 





Number of iterations (100 random splits 70/30) 




Figure 8: RLS2 on the Wpbc (left) and Sonar (right) dataset: classification 
accuracy on the test data (top), number of selected kernels (center), and number 
of iterations (bottom) along a regularization path. 
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Table 4: RLS2 regression and classification: number of selected kernels in cor- 
respondence with the optimal value of A, number of iterations and training time 

in seconds to compute^ a rc^gularization path. 



Data.scI 


Split ratio 


Kernels 


Iterations 


Time (s) 


AUTO-MPG 


60/40 


21.3(2.64) 


78.8(2.16) 


7.2(1.22) 




70/30 


22.1(2.3) 


80.6(2.17) 


17.4(1.82) 


CPU 


60/40 


30.2(3.65) 


46.6(2.42) 


26.4(13) 




70/30 


31.9(3.43) 


47(1.97) 


29(13.8) 


Servo 


60/40 


11(1.37) 


71(2.26) 


1.95(0.214) 




70/30 


11.4(1.8) 


73.6(2.09) 


2.37(0.162) 


Housing 


60/40 


37.4(3.95) 


90.6(2.73) 


38.7(1.81) 




70/30 


38.7(3.31) 


92.4(2.43) 


51.9(2.07) 


Heart 


60/40 


13.7(2.03) 


75.9(2.33) 


4.22(0.234) 




70/30 


14.3(2.12) 


77.8(2.16) 


5.83(0.411) 


Liver 


60/40 


23.5(2.87) 


54.6(2.53) 


3.56(0.545) 




70/30 


15.1(2.11) 


55.1(2.43) 


4.51(0.399) 


Pima 


60/40 


14.6(2.17) 


84.7(2.67) 


53.6(2.23) 




70/30 


15.6(2.18) 


86.6(2.33) 


70.8(4.14) 


Ionosphere 


60/40 


61.6(6.03) 


70.7(3.13) 


16.7(0.96) 




70/30 


67.6(6.92) 


73.1(2.92) 


24.2(2.45) 


Wpbc 


60/40 


2.29(0.832) 


60.5(3.03) 


4.89(0.305) 




70/30 


2.13(0.906) 


60.6(2.42) 


6.85(0.198) 


Sonar 


60/40 


45.1(4.45) 


60.5(2.18) 


9.42(0.191) 




70/30 


35.9(3.5) 


61.6(1.9) 


11.4(0.227) 
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path. Datascts have been pre-processed by removing examples with missing fea- 
tures and converting categorical features to binary indicators. For some of the 
datasets (see Table 2) input features have been standardized to have zero mean 
and unitary standard deviation. For classification, output labels are ±1 and 
predictions are given by the sign of (/2 o fi)- For regression, an intercept term 
equal to the mean of training outputs is subtracted to the training data. Basis 
kernel matrices are pre-computcd and the scaling matrix S is chosen according 
to the rule (18) with transductive scaling. 

To better compare the results to similar benchmarks for multiple kernel 
learning, see e.g. [35], the same set of basis kernels for all the datascts has 
been chosen. We remark that such agnostic approach is not representative ^f 
a realistic application of the algorithm, in which the choice of basis kernels Kk 
should reflect a-priori knowledge about the learning task to be solved. The set 
of basis kernels contains the following: 

• Polynomial kernels 

Kk{xi,X2) = (1-^x^x2)" 

with n = 1,2,3. 

• Gaussian RBF kernels 

Kk{xi,X2) = exp (-711x1 - X2IP) , 

with 10 different values of 7 chosen on a logarithmic scale between 10~^ 
and 10^. 

Kernels on each single feature and on all the features are considered, so that the 
number of basis kernels is an affine function of the number of features (recall that 
categorical features have been converted to binary indicators). More precisely, 
we have m = 13{N + 1). 

All the profiles of test prediction performance, number of kernels and number 
of iterations for the 70/30 datasct split in correspondence with different values 
of the regularization parameter are reported in Figures 4-8. From the top plots, 
it can be seen that test performances are relatively stable to variations of the 
regularization parameter around the optimal value A*, indicating that RLS2 
is robust with respect to the use of different model selection procedures. For 
regression datasets such as Cpu, Servo, or Housing, optimal performances 
seems to be reached in correspondence with the un-regularized solution A — > 
0+. Lines in light color are associated with single dataset splits, while thick 
lines are the averages over different dataset splits. The vertical dotted line 
corresponds to the value of the regularization parameter with best average test 
performance. The average number of selected kernels vary quite smoothly with 
respect to the regularization parameter. For large values of A, RLS2 chooses 
only one basis kernel. For small values of A, the number of selected kernels 
grows and exhibits an higher variability. The bottom plots in Figures 4-8 give 
an idea of the computation burden required by alternate optimization for RLS2 
in correspondence with different values of A. In correspondence with high values 
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of the rcgularization parameter, the algorithm converges in a single iteration. 
This occurs also for the very first value on the regularization path, meaning that 
the initialization rule is effective. With low values of A, RLS2 also converges in 
a single iteration, since the second layer doesn't change much from an iteration 
to the next. 

Test performances for regression and classification are summarized in Ta- 
ble 3, where the average and standard deviation with respect to the 100 dataset 
splits of either RMSE (regression) or accuracy (classification) in correspondence 
with to the best value of A are reported. Performances of other kernel learning 
algorithms on some of these datasets can be found in [27, 33, 35] and references 
therein. Another benchmark study that might be useful for comparison is [29]. 
Comparisons should be handled with care due to the use of different experi- 
mental procedures and optimization problems. For instance, [27] uses an 80/20 
dataset split ratio, [33] uses 60/40, while [35] uses 70/30. Also, different num- 
bers of dataset splits have been used. Individuating what kind of datasets are 
better suited to what algorithm is a complex issue, that is certainly worth fur- 
ther investigation. These experiments shows that RLS2 results are competitive 
and complexity of the model is well controlled by regularization. In particu- 
lar, state of the art performances arc reached on Servo, Housing, Hearth, 
Pima, Ionosphere. Finally, it should be observed that, although multiple ker- 
nel learning machines have been used as black box methods, the use of basis 
kernels on single features sometimes also selects a subset of relevant features. 
Such property is remarkable since standard single-layer kernel methods are not 
able to perform "embedded" feature selection. 

Table 4 reports the average and standard deviation of number of selected 
kernels in correspondence with the optimal value of A, number of iterations and 
training time needed to compute a regularization path for all the regression and 
classification datasets studied in this subsection. From the columns of selected 
kernels, it can be seen that a considerable fraction of the overall number of ba- 
sis kernels is filtered out by the algorithm in correspondence with the optimal 
value of the rcgularization parameter. By looking at the number of iterations 
needed to compute the path with 30 values of the regularization parameter, 
one can see that the average number of iterations to compute the solution for 
a single value of A is in between 1 and 3, indicating that the warm-start pro- 
cedure is rather effective at exploiting the continuity of solutions with respect 
to the regularization parameter. As a matter of fact, most of the optimization 
work is spent in correspondence with a central interval of values of A, as shown 
in the bottom plots of Figures 4-8. Finally, from the last column, reporting 
average and standard deviation of training times, it can be seen that, with the 
current implementation of RLS2, regularization paths for all the datasets in 
this subsection can be computed in less than one minute in the average (see 
the introduction of this section for experimental details) . Although the current 
implementation of RLS2 has been designed to be efficient, it is believed that 
there's still considerable margin for further computational improvements. This 
issue may well be the subject of future developments. 
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Figure 9: 14 Cancers data: profiles of training error, 8-fold validation error 
and test error for different values of the regularization parameter. Standard error 
bands for the validation error are also shown. The vertical line corresponds to 
the least complex model maximizing the validation accuracy. 

5.3 RLS2: multi-class classification of microarray data 

RLS2 can be applied to multi-class classification problems by solving several 
binary classification problems and combining their outcomes. A possible way 
to combine binary classifiers is the OVA (one versus all) approach, in which 
each class is compared to all the others and test labels are assigned to the class 
maximizing the confidence (the real-valued output) of the corresponding binary 
classifier. 

Linear RLS2 with OVA has been applied to the 14 Cancers dataset [36], a 
delicate multi-class classification problem whose goal is to discriminate between 
14 different types of cancer, on the basis of microarray measurements of 16063 
gene expressions. Gene measurements and type of cancer (labels) are available 
for 198 patients, the dataset being already divided in a training set of 144 
patients, and a test set of 54 patients. Another important goal in this problem is 
to individuate a small subset of genes which is relevant to discriminate between 
the different kind of cancer. [20] reports several results for these data using 
a variety of classification methods. Algorithms such as the Support Vector 
Classifier uses all the genes to compute the classification boundaries, while others 
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Table 5: 14 Cancers data: average test error (see the text for details) and 
number of selected genes for different classification algorithms. For RLS2, the 
number of selected genes is relative to the least complex model maximizing the 
validation accuracy. Results for methods other than RLS2 are taken from [20] . 



Method Test errors Selected genes 



Support Vector Classifier 


14.0 


16,063 


Ll-penalized multinomial 


13.0 


269 


Lasso regression (OVA) 


12.5 


1,429 


L2-penalized discriminant analysis 


12.0 


16,063 


Elastic-net penalized multinomial 


11.8 


384 


Linear RLS2 (OVA) 


9.8 


855 



such as Lasso or Elastic Net are also able to select a subset of relevant genes. 
Since the feature selection experiment in subsection 5.1 suggests that RLS2 may 
be very efficient at selecting relevant features from noisy examples, a microarray 
dataset seems to be an appropriate choice for testing the algorithm. 

Gene expressions for each patient have been firstly standardized to have zero 
mean and variance one. For each binary classifier, coefficients Sj are chosen as 
Si = (cr^^ + ""-) : where o"^ and o"^ arc the within-class sample variances 
computed using all the training data. Such scaling gives more weight to genes 
whose expressions exhibits small within-class variability, and seems to slightly 
improve classification performances. A validation accuracy profile has been 
computed rising stratified 8-fold cross validation, where the folds are organized 
to approximately preserve the class proportions""^. For the final model, we pick 
the highest value of A maximizing the validation accuracy. Figure 9 reports 
the profiles of training accuracy, cross-validation accuracy with corresponding 
standard error bands, and test accuracy for 50 logarithmically spaced values 
of the regularization parameter. Table 5 reports the number of test errors and 
selected genes in correspondence with the value of A chosen in the validation 
phase, for RLS2 and other methods from [20] . Test errors in Table 5 are averages 
of test errors for different classifiers associated with all the different values of 
the regularization parameter that maximizes the cross-validation score (this 
explains the presence of non-integer values). For linear RLS2, such procedure 
yields a value of about 9.8. In correspondence with the least complex model 
maximizing the cross-validation accuracy, one obtain 10 test errors using 855 
genes. Although the test set size is too small to draw significative conclusions 
from this comparison, linear RLS2 seems to work rather well on this problem 
and achieve the best test performances. Such good performance also confirm 
effectiveness of the OVA multi-class approach for RLS2. 

^We thank Trevor Hastie for kindly providing the folds used in their experiments. 
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6 Conclusions 



The connection between learning with a two-layer network and the problem of 
learning the kernel has been analyzed. While architectures with more than one 
layer are justified by a representer theorem, an alternative perspective to look 
at the problem of kernel learning is proposed. Such perspective makes clear 
that these two methodologies aim both at increasing the approximation power 
of standard single layer methods by using machines that can adaptively select 
functions with a variety of shapes when little prior knowledge is available. In 
particular, the multiple kernel Iciarning framework is shown to be an important 
specific case of a more general two-layer architecture. We also introduce RLS2, a 
new method to perform multiple kernel learning based on regularization with the 
square loss function and alternate optimization. RLS2 exhibits state of the art 
performances on several learning problems, including multi-class classification 
of microarray data. An open source set of MATLAB scripts for RLS2 and linear 
RLS2 is available at http://www.mloss.org and also includes a Graphic User 
Interface. 
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Appendix (proofs) 



Proof of Theorem 1 

For any fixed /i, let Zi := fi(xi). By fixing an optimal /i, Problem 1 can 
be written as a function of /2 alone as 



min iY^Li{f2{zi)) + R2{\\f2\\n.) 
J2&H2 \ — ■ 



By standard representer theorems for vector valued functions (see [31] and the 
remark on monotonicity in [40] after Theorem 1), there exists an optimal /2 in 
the form 

e 

h{z) = Y,K\zi,z)bi. 

i=l 

Then, by fixing on optimal /2 in this form, Problem 1 can be written as a 
function of f\ alone as 



min (;El, {f^{x{)) + Rr{\\h\\n,) 



where Lj {z) := Lj {f2{z)). Notice that the new loss functions Lj depends on /2. 
Again, by the single-layer representer theorem the finite kernel expansion for 
/i follows. Finally, it is immediate to see that the overall input-output relation 
/2 o /i can be written as in (2). 
Proof of Lemma 1 

By linearity of /2, it is immediate to see that (3) holds. It follows that 

Y,L, ((/2 o = j^u ((/2 o n){x,)) . 



In addition, 

a 



In the last equation, we exploit the fact that jl{x) = I{x), for any positive 7, 
a property that is satisfied only by indicator functions. The thesis follows by 
letting A = a//32. 
Proof of Theorem 2 

Problem 2 is a specific instance of Problem 1 . The functional to minimize is 
bounded below, lower semi-continuous and radially-unbounded with respect to 
(/i,/2)- Existence of minimizers follows by weak-compactness of the unit ball 
in 'Hi and 'H2- By Theorem 1, there exists an optimal /i in the form 

I i 
^^^^{^^^3)0-3 ^^<^'^ag[Ki{x,Xj), . . . ,Km{x,Xj)^ aj. 
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Introduce the matrix A G M™^^ whose rows arc denoted by (c*)"^ and whose 
columns are aj. Then, the i-th component of /i can be written as: 

i 

By Theorem 1, there exists an optimal /2 such that 

e e I 

Hz) =Y,bjK\h{xi),z) =Y,bjZ^SMxj) = z^sY^b.Mxj) = z^Sw, 
j=i j=i j=i 

where 

e 

Letting matrices i?'^ e M.^^^ like in (6) and Q as in (4), Problem 2 can be 
rewritten as 



mm 



0K^«;,i?'=cM+^^^^^^ , subject 
\fe=i / fe=i ^'^ 



to w'^S'u; < 1. 



By optimizing with respect to vectors c% we have 

e -SiWiR'dQ {^WkR^J'] +XR'c\ 



where d is the sub-differential of a convex function [39]. Now, letting 



we obtain 

= SiWiC. 

Letting di := Siwf and R := J2iLi diR^, Problem 2 boils down to Problem 
3. By Theorem 1 again, the overall input-output relation can be written as in 
equation (2), where the kernel K satisfy 

m 

K{X„X2) = if2(/i(xi),/l(x2)) = /l(xi)^5/i(x2)=^sJKxi)/{(x2) 

i=l 

m it 
= X) Siwl ^ ^ Cj^Cj^k,{Xj^,Xi)K,{Xj^,X2) 



= '^diKi{xi,X2), 
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and Ki arc as in (5). 

Proof of Lemma 2 Problem 4 can be rewritten as 



mm 



Q{z) + -z 



(20) 



subject to (6), where S+ denotes the cone of m x m positive semi-definite 
matrices. This problem can be seen to be jointly convex in {z,R) using an 
argument due to [25]: the term z'^R^z is a matrix-fractional function (see e.g. 
[9], Example 3.4), which is a jointly convex function of the pair {z,R). This 
easily follows by noticing that its epi- graph is a convex set: 



z'^Rh <a <^ 



a 
z 



z 
R 



Since Q is a convex function, the overall functional (20) is convex. Minimization 
in (20) subject to linear constraints (6) is thus a convex optimization problem. 
Since i? is a linear function of d, Problem 4 is also convex. 

To prove (8), assume that {z* , d*) is an optimal pair for Problem 4. Without 
loss of generality, we can assume d* ^ 0. Indeed, if there's an optimal solution 
with d* = 0, then z = 0,d ^ is optimal as well. Now, let 7 := > 
notice that < 7 < 1. Introducing the new pair {z,d) = (z*,d*/'y), the value 
of the objective functional in correspondence with (z, d) is 

Q{z) + ^z''RHd)z = Q{z*) + ^{z*fRHd*)z* < Q{z*) + ^{z*fR\d*)z*, 

so that the new pair is optimal as well. 
Proof of Lemma 3 

By Lemma 2, minimization with respect to d can be restricted to the stan- 
dard simplex A^. For any fixed d, the functional of Problem 5 is a convex 
quadratic fimction of c. If c*{d) satisfy equation (13), then the gradient of the 
objective functional with respect to c is zero in c*, meaning that c* is optimal. 
Dropping the dependence on d, equation (13) can be rewritten as 

y-Rc* = Ac*. 

In correspondence with such optimal c* , we have 

A 



1 \\y - Rc*f + ^c*^Rc* = ^ lie* 

2 " " 2 2 " 



+ -c*^ (y-Ac*) 



^ T * 



Proof of Lemma 4 

By Lemma 2, minimization with respect to d can be restricted to the stan- 
dard simplex A„. In addition, we have 



l\\y-Rcf + ^c^Rc = \ 



u — Rc + 



Ac 



l\\u-Rcf 



+ ^c^Rc 
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where does not depend on R (and thus does not depend on d). Now, 

recahing that 

m 

R{d) = ^diR\ 

i=l 

we have 

m m 

i?c = ^ diR'c = ^ diVi = Vd. 

i=l i=l 

Proof of Lemma 5 

Prom equation (13), we have 

Coo = lim {R{d) + A/)"S = 0. 

X—>-+oo 

Since R{d) is a continuous function of d defined over the compact set A^, by 
fixing any matrix norm || ■ || there exists a suflaciently large A such that 

max ||-R(rf)|| < A. 

For A > A, the expansion 

{R{d)/X + 7)-' = 7 - R{d)/X + o(1/A2), 
holds. By Lemma 3, it follows that 

d*(A) = arg min j/^(7?(d)/A + 7)-S 

= arg mjn [A||j/f /2 - {y^R{d)y) + o(l/A)] 
= argmm [\\yf /2 - [y'^ R{d)y)/X + 0(1/ X^)] 
= arg max [y^ R{d)y — o{l / X)] . 
Hence, doo = lim;^^+(x) d* (A) solves the following linear program 

max y^^di{y'^ K'y). 

Then, it is easy to see that d = ek,k G argmaxj=i^..,^TO is an optimal 

solution of the hnear program, where k is any index maximizing the "kernel 
alignment" y^R^y. 
Proof of Lemma 6 

When basis kernel are chosen as in (10), f{x) can be written as in (ll)-(12), 
and we have 

R{d) = HTH'^. 
By letting z := 77^c, it follows that 

Rc = HTz = Hz 

jRc = Jhth^c = ||rV2^||2 = ||rV2j||2 

Hence, Problem 5 reduces to Problem 8. 
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